Extracting Human Goals from Weblogs
نویسندگان
چکیده
Knowledge about human goals has been found to be an important kind of knowledge for a range of challenging problems, such as goal recognition from peoples’ actions or reasoning about human goals. Necessary steps towards conducting such complex tasks involve (i) acquiring a broad range of human goals and (ii) making them accessible by structuring and storing them in a knowledge base. In this work, we focus on extracting goal knowledge from weblogs, a largely untapped resource that can be expected to contain a broad variety of human goals. We annotate a small sample of weblogs and devise a set of simple lexico-syntactic patterns that indicate the presence of human goals. We then evaluate the quality of our patterns by conducting a human subject study. Resulting precision values favor patterns that are not merely based on part-of-speech tags. In future steps, we intend to improve these preliminary patterns based on our observations. 1 Knowledge about Human Goals Knowledge about human goals has been found to be an important kind of knowledge for a range of challenging research problems, such as goal recognition from people’s actions, reasoning about people’s goals or the generation of action sequences that implement goals (planning) [Schank and Abelson, 1977]. In contrast to other kinds of knowledge, e.g. commonsense, knowledge about human goals provides a different perspective on textual resources putting more emphasis on future aspects and activities. We regard the acquisition of this knowledge as a first step towards conducting complex tasks such as planning. Regardless whether the knowledge to extract is about human goals, commonsense [Liu and Singh, 2004] or the world in general [Schubert and Tong, 2003; Clarke, 2009], the acquisition process often includes the application of indication and extraction patterns. Moreover, knowledge acquisition approaches differ in how much manual intervention is necessary (or desired) in the knowledge acquisition process. Existing approaches include utilizing human knowledge engineering [Lenat, 1995], volunteer-based [Liu and Singh, 2004], game-based [Lieberman et al., 2007; von Ahn, 2006] or semiautomatic approaches [Eslick, 2006]. Yet, in this paper we are interested in approaching the question how knowledge about human goals can be automatically derived from social media text, in our case weblogs. To give an example, here is a snippet of a blog post where human goals are underlined: Last September, we moved into our new home. I had plans for this home, the first house--not apartment-my husband and I would live in. I was going to refinish some hand-me-down furniture we have, and I was going to plant a wonderful garden, starting with bulbs that would bloom in the spring. Crocuses, hyacinths, tulips--all of my favorites. And I would know, all winter long, that they were sleeping in the dark, cold soil, waiting to awake with the first light and warmth of spring Though weblogs exhibit some disadvantages when it comes to quality issues, e.g., textual content is prone to noise, we can expect that weblogs contain a broad variety of human goals. In the remaining part, we describe our approach to address goal extraction from open text by deriving and evaluating a first set of lexico-syntactic patterns. We then discuss strengths and weaknesses of our patterns based on a small human subject study in order to improve them in future steps. 2 Patterns To Extract Human Goals We employ and adapt the definition from [Tatu, 2005] who defines human goals as: “Expressions of a particular action that shall take place in the future, in which the speaker is some sort of agent.” The following, exemplary sentence taken from the blog post snippet presented above: “I was going to refinish some hand-me-down furniture” indicates the person’s intention to prettify some furniture. In contrast to Tatu’s definition, we do not include information about the speaker into our patterns to keep them simple. However, the idea to include this kind of information is discussed in Section 3.3. When comparing our setup to [Tatu, 2005]’s, we can observe three differences. Firstly, the author developed part-of-speech patterns by annotating and examining samples from the Brown corpus. Working on the Brown corpus is advantageous because this corpus has already been tagged – the chance of getting incorrect part-of-speech tags is thereby reduced. Secondly, the language used in the Brown corpus is different than language used in weblogs. Thirdly, [Tatu, 2005]’s motivation to address challenges in question answering (QA). She expected that sentences containing expressions of human goals are better suited to answer a certain kind of questions. Textual resources in the QA domain exhibit other characteristics than weblogs, for instance, people use weblogs to tell stories or write diarylike entries. We hypothesize that extraction patterns yield different results depending on weblog characteristics, e.g., does the weblog contain a story-like structure or not? We followed a common path to acquire knowledge from textual resources by manually examining the textual environment to identify appropriate patterns [Hearst, 1992]. As a first step, we drew a small, random sample (~ 100 blog posts) from the ICWSM 2009 Spinn3r Dataset [Burton et al., 2009] and annotated the textual contents according to the above definition. The annotation task was conducted by one of the authors and an undergraduate student. Table 1 illustrates ten resulting, lexico-syntactic patterns based on these annotations which are partly inspired by patterns by [Tatu, 2005]. She employs these patterns to identify sentences containing intentional expressions in order to build up a training set for further experiments. Part-of-speech tags throughout this paper are consistent with the Penn Treebank Tag Set. Table 1: Lexico-syntactic patterns to identify and extract human goals and matching instances. (*) denotes no, one or several occurrences, (+) denotes at least one occurrence, (?) denotes one optional occurrence and (|) denotes a logical OR. Nr. Lexico-Syntactic Patterns Matching Instances 1 needs/VBZ to/TO organize/VB 2 alcohol/NN to/TO get/VB 3 available/JJ to/TO read/VB 4 find/VB a/DT keyboard/NN 5 wanted/VBD to/TO kill/VB 6 intend/VBP to/TO quit/VB 7 * goal/NN is/VBZ to/TO eat/VB 8 like/VB to/TO share/VB 9 < PRP> wants/VBZ them/PRP to/TO go/VB 10 ? | get/VB you/PRP to/TO purchase/VB In the next section, we apply our extraction patterns to a larger sample of weblogs. We then evaluate the quality of every pattern by calculating precision values. 3 Quality & Characteristics In this section, we briefly describe our data preparation steps and pattern matching process. We report precision results of preliminary study on a set of ~205.000 blog posts and discuss observed weaknesses of our patterns. We conclude this section with suggesting several possibilities to improve and extend the patterns to extract knowledge about human goals. 3.1 Data Sets For our experiments, we used the ICWSM 2009 Spinn3r Dataset which comprises 44 million blog posts made between August 1 and October 1, 2008. We randomly drew ~205.000 blog posts and further separated them into two datasets – one with posts containing stories – one with posts containing non-stories. We hypothesize that blog posts telling a story contain more human goals than other blog posts. We use work from [Gordon and Reid, 2009] that defines a story as a series of causally related events in the past. They developed an automatic algorithm to identify blog posts most likely containing a story (reported precision values up to 75%). Moreover, they provide an index of all blog posts in the ICWSM 2009 Spinn3r Dataset that were classified as containing storylike structures. Using this information, we obtained two datasets – one containing posts with stories (~3000) and one containing posts without stories (~202.000). 3.2 Data Preparation We first extracted the content of the field in the corresponding xml files of the random sample. Since the textual content of the weblogs was often messy, we had to clean it as preparation for the subsequent part-ofspeech tagging. The cleaning procedure included removing html snippets and special characters. For the process of part-of-speech tagging and pattern matching, we used functionality of the Natural Language Processing Toolkit (NLTK) in combination with Python as programming language. 3.3 Strengths and Weaknesses of our Goal Extraction Patterns We applied our patterns from Table 1 to two datasets (see Section 3.1) which were randomly drawn from the ICWSM 2009 Spinn3r Dataset (tiergroups 1-3). Table 2 shows the number of matches per extraction pattern. The frequency numbers corroborate our hypothesis that there is a higher potential for the presence of human goals weblogs containing a story. Since there are ~67 times more blog posts containing non-stories than stories, the numbers are not directly comparable. In order to compare them, we calculate the ratio of (number of found goal instances) vs. (number of blog posts). We notice that the ratio is always highly in favor of the blog posts containing stories. Consider for example ratios for the first pattern : 486/3,000 = 0.16 for stories vs. 6,220/202,000 = 0.03 for non-stories. Table 2 illustrates the number of matched goal instances per extraction pattern as well as precision values (sample size of 20) for both story and non-story content. Lexico-Syntactic Patterns Story Set (#3.000) Non-Story Set (#202.000) Freq. Prec. Freq. Prec. 486 0.1 6220 0 2018 0 6661 0 677 0.05 5424 0.06 1405 0.06 15129 0 398 0,53 3614 0.37 10 0.6 86 0.5 * 2 0.5 39 0.82 36 0.16 592 0.18 < PRP> 30 0.83 291 0.11 ? | 16 0.25 47 0.32 For every pattern, an undergraduate student rated a maximum number of 40 matched instances whether a human goal is expressed or not. The student took the context (sentence boundary) into account when he rated the matched instances. The precision values for every pattern are calculated based upon 20 instances from story content and 20 instances from non-story content. In five cases, where the pattern matched fewer than 20 instances, the precision values are based on a slightly lower number of rated samples. 1 http://www.nltk.org/ To discuss strengths and weaknesses, we group our patterns into three categories and provide positively and negatively rated instances per pattern category, i.e. true positives and false positives. The first category (Nr. 1 to 4) contains pure part-of-speech patterns, the second category (Nr. 5 to 8) contains part-of-speech patterns combined with goal keywords and patterns of the third group (Nr. 9 to 10) can be expected to extract not only goal knowledge but to extract additional information on the participants involved. We can observe that precision values in the first category are low. Though these pure part-of-speech patterns produce a lot of matches, the matched instances appear too general and are therefore inappropriate to extract human goals. Moreover, positive examples are partly matched by other categories such as “want him to learn to ride a bike” which actually serves as positive examples for patterns Nr. 1 and Nr. 4. Table 3 shows true and false positives extracted by these patterns. Table 3 shows true and false positives of extracted human goals (from patterns Nr.1-4). Matched Instance Context Goal learn/VB to/TO ride/VB that want him to learn to ride a bike. yes have/VB to/TO agree/VB I might have to agree on some levels no car/NN to/TO go/VB We got in the car to go to the hospital no things/NNS to/TO load/VB I just have a few more things to load no willing/JJ to/TO believe/VB I am willing to believe in love yes ready/JJ to/TO take/VB I was ready to take that chance with you no ride/VB a/DT bike/NNP that want him to learn to ride a bike yes take/VB another/DT night/NN I can’t take another night of this no Patterns in the second category almost all achieved a precision value higher than 50% except for two patterns. The first one is pattern Nr. 8. When reviewing the matched instances, we find sentences such as: “She swims a lot and likes to drink lake water.” (see Table 4) where a person’s preferences are expressed. We would rather like to match sentences such as “I like to play soccer in the evening” implying an action that takes place in the future. Therefore, in order to improve this pattern, we could require the presence of certain temporal expressions such as ‘today’ or ‘tomorrow’. The second exception is pattern Nr. 7 (story content) where the moderate precision value is most likely due to the low number of matches. In case of the nonstory content, this pattern yields high precision values demonstrating its usefulness. Table 4 shows true and false positives of extracted human goals (from patterns Nr.5-8). Matched Instance Context Goal wanted/VBD to/TO go/VB I never wanted to go back to school yes wants/VBZ to/TO do/VB he wants to do it no intend/VBP to/TO get/VB I intend to get up at 7:30 yes intend/VB to/TO stay/VB Jean, do you intend to stay here until
منابع مشابه
Extracting Topics From Weblogs Through Frequency Segments
In this paper, we present an approach to extracting topics from weblogs by using terms that appear in them. We model a term in terms of frequency segments, i.e., sequential occurrences of the term over time, as the unit of characterization. A notable feature of the model is its approximation of changes in the dynamics of term frequencies; it captures the granularity of frequencies from the very...
متن کاملExtracting Navigational Information from Link Structure in Blogoshere
The paper describes a method to extract navigational information from link structure in Blogosphere such as Everything2. The purpose of the method is to support readers of weblogs in selecting hyperlinks from scattered anchor texts in the web pages by providing navigational information. The navigational information consists of hyperlinks that are organized with approximate subsumption relations...
متن کاملImportant Weblog Identification and Hot Story Summarization
In this paper, we propose the architecture for a weblog data mining system. Our objective is to allow users to interactively understand the blogspace by providing a system framework for retrieving relevant weblogs and obtaining highlighted information. We focus on two important technical components in the system. The first is weblog ranking. We introduce weighted link-based weblog ranking, whic...
متن کاملExtracting Domain-Dependent Semantic Orientations of Latent Variables for Sentiment Classification
Sentiment analysis of weblogs is a challenging problem. Most previous work utilized semantic orientations of words or phrases to classify sentiments of weblogs. The problem with this approach is that semantic orientations of words or phrases are investigated without considering the domain of weblogs. Weblogs contain the author’s various opinions about multifaceted topics. Therefore, we have to ...
متن کاملMapping the Blogosphere in America
This short paper constitutes the first phase of a long-term project focused on probing American urban culture by examining the hyperlinks and text of personal weblogs. It discusses methods of extracting geographic location information from weblogs and ways of indexing weblogs to city units. After a brief introduction to the broader research plan, the paper proposes a process to automatically ex...
متن کامل